{
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "<a href=\"https://colab.research.google.com/github/milvus-io/bootcamp/blob/master/bootcamp/RAG/advanced_rag/sentence_window_with_langchain.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
    "\n",
    "# Sentence window"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Google Colab preparation[optional]\n",
    "This is an optional step, if you want to run this notebook on Google Colab."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "! git clone https://github.com/milvus-io/bootcamp.git"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "import shutil\n",
    "src_dir = \"./bootcamp/bootcamp/RAG/advanced_rag/rag_utils\"\n",
    "dst_dir = \"./rag_utils\"\n",
    "shutil.copytree(src_dir, dst_dir)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "! pip install --upgrade langchain langchain-community langchain_milvus langchain-openai bs4"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Please prepare you [OPENAI_API_KEY](https://openai.com/index/openai-api/) in your environment variables.\n",
    "![](imgs/colab_api_key1.png)"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "### If you are running this notebook on Google Colab, you have to restart this session by `Cmd/Ctrl + M`, then press `.` to make the environment take effect."
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "from google.colab import userdata\n",
    "import os\n",
    "\n",
    "os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "----\n",
    "## Get started\n",
    "![](imgs/sentence_window.png)"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Prepare the data\n",
    "\n",
    "We use the Langchain WebBaseLoader to load documents from [blog sources](https://lilianweng.github.io/posts/2023-06-23-agent/) and split them into chunks using the RecursiveCharacterTextSplitter."
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "import bs4\n",
    "from langchain_community.document_loaders import WebBaseLoader\n",
    "from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
    "\n",
    "from rag_utils.vanilla import vectorstore\n",
    "\n",
    "# Create a WebBaseLoader instance to load documents from web sources\n",
    "loader = WebBaseLoader(\n",
    "    web_paths=(\"https://lilianweng.github.io/posts/2023-06-23-agent/\",),\n",
    "    bs_kwargs=dict(\n",
    "        parse_only=bs4.SoupStrainer(\n",
    "            class_=(\"post-content\", \"post-title\", \"post-header\")\n",
    "        )\n",
    "    ),\n",
    ")\n",
    "# Load documents from web sources using the loader\n",
    "documents = loader.load()\n",
    "# Initialize a RecursiveCharacterTextSplitter for splitting text into chunks\n",
    "text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
    "\n",
    "# Split the documents into chunks using the text_splitter\n",
    "docs = text_splitter.split_documents(documents)"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "We add the wider window text information to each chunk using the `write_wider_window` function. We can see the metadata of the document contains a `wider_text` field, which is the wider window text information."
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(Document(page_content='Short-term memory: I would consider all the in-context learning (See Prompt Engineering) as utilizing short-term memory of the model to learn.\\nLong-term memory: This provides the agent with the capability to retain and recall (infinite) information over extended periods, often by leveraging an external vector store and fast retrieval.\\n\\n\\nTool use\\n\\nThe agent learns to call external APIs for extra information that is missing from the model weights (often hard to change after pre-training), including current information, code execution capability, access to proprietary information sources and more.', metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'start_index': 979, 'end_index': 1579, 'wider_text': 't System Overview#\\nIn a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:\\n\\nPlanning\\n\\nSubgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.\\nReflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results.\\n\\n\\nMemory\\n\\nShort-term memory: I would consider all the in-context learning (See Prompt Engineering) as utilizing short-term memory of the model to learn.\\nLong-term memory: This provides the agent with the capability to retain and recall (infinite) information over extended periods, often by leveraging an external vector store and fast retrieval.\\n\\n\\nTool use\\n\\nThe agent learns to call external APIs for extra information that is missing from the model weights (often hard to change after pre-training), including current information, code execution capability, access to proprietary information sources and more.\\n\\n\\n\\n\\nFig. 1. Overview of a LLM-powered autonomous agent system.\\nComponent One: Planning#\\nA complicated task usually involves many steps. An agent needs to know what they are and plan ahead.\\nTask Decomposition#\\nChain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms '}),\n",
       " 63)"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from rag_utils.sentence_window import write_wider_window\n",
    "\n",
    "write_wider_window(docs, documents[0], offset=500)\n",
    "\n",
    "docs[1], len(docs)"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Build the chain\n",
    "\n",
    "We load the docs into milvus vectorstore, and build a milvus retriever."
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "vectorstore.add_documents(docs)\n",
    "retriever = vectorstore.as_retriever()"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "Define the vanilla RAG chain."
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_core.output_parsers import StrOutputParser\n",
    "from langchain_core.runnables import RunnablePassthrough\n",
    "from rag_utils.vanilla import format_docs, rag_prompt, llm\n",
    "\n",
    "vanilla_rag_chain = (\n",
    "    {\"context\": retriever | format_docs, \"question\": RunnablePassthrough()}\n",
    "    | rag_prompt\n",
    "    | llm\n",
    "    | StrOutputParser()\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "Define the sentence window chain."
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "from langchain_core.output_parsers import StrOutputParser\n",
    "from langchain_core.runnables import RunnablePassthrough\n",
    "from rag_utils.sentence_window import format_docs_with_wider_window\n",
    "\n",
    "sentence_window_chain = (\n",
    "    {\n",
    "        \"context\": retriever | format_docs_with_wider_window,\n",
    "        \"question\": RunnablePassthrough(),\n",
    "    }\n",
    "    | rag_prompt\n",
    "    | llm\n",
    "    | StrOutputParser()\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Test the chain\n"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[vanilla_result]:\n",
      "The context provided does not list different types of Artificial Neural Network (ANN) algorithms. However, it does mention different algorithms used for Approximate Nearest Neighbors (ANN) search, which are LSH (Locality-Sensitive Hashing), ANNOY (Approximate Nearest Neighbors Oh Yeah), FAISS (Facebook AI Similarity Search), and ScaNN (Scalable Nearest Neighbors).\n",
      "\n",
      "[sentence_window_result]:\n",
      "The different types of Approximate Nearest Neighbors (ANN) algorithms mentioned in the context are:\n",
      "\n",
      "1. LSH (Locality-Sensitive Hashing): This algorithm uses a hashing function to map similar input items to the same buckets with high probability.\n",
      "\n",
      "2. ANNOY (Approximate Nearest Neighbors Oh Yeah): This algorithm uses random projection trees, a set of binary trees where each non-leaf node represents a hyperplane splitting the input space into half and each leaf stores one data point.\n",
      "\n",
      "3. HNSW (Hierarchical Navigable Small World): This algorithm is inspired by the idea of small world networks where most nodes can be reached by any other nodes within a small number of steps.\n",
      "\n",
      "4. FAISS (Facebook AI Similarity Search): This algorithm operates on the assumption that in high dimensional space, distances between nodes follow a Gaussian distribution and thus there should exist clustering of data points.\n",
      "\n",
      "5. ScaNN (Scalable Nearest Neighbors): The main innovation in ScaNN is anisotropic vector quantization. It quantizes a data point to such that the inner product is as similar to the original distance as possible, instead of picking the closet quantization centroid points.\n"
     ]
    }
   ],
   "source": [
    "query = \"what are the different types of ANN algorithms?\"\n",
    "\n",
    "vanilla_result = vanilla_rag_chain.invoke(query)\n",
    "sentence_window_result = sentence_window_chain.invoke(query)\n",
    "print(\n",
    "    f\"[vanilla_result]:\\n{vanilla_result}\\n\\n[sentence_window_result]:\\n{sentence_window_result}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "We can see that the sentence window results include `HNSW`, but the vanilla results do not. This is because the context window of the sentence window is wider, so it can contain more relevant content fragments to reduce the loss of information."
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}